Mining Very Large Databases

نویسندگان

Venkatesh Ganti

Johannes Gehrke

Raghu Ramakrishnan

چکیده

38 Computer E stablished companies have had decades to accumulate masses of data about their customers , suppliers, and products and services. The rapid pace of e-commerce means that Web startups can become huge enterprises in months, not years, amassing proportionately large databases as they grow. Data mining, also known as knowledge discovery in databases, 1 gives organizations the tools to sift through these vast data stores to find the trends, patterns, and correlations that can guide strategic decision making. Traditionally, algorithms for data analysis assume that the input data contains relatively few records. Current databases, however, are much too large to be held in main memory. Retrieving data from disk is markedly slower than accessing data in RAM. Thus, to be efficient, the data-mining techniques applied to very large databases must be highly scalable. An algorithm is said to be scalable if—given a fixed amount of main memory—its runtime increases linearly with the number of records in the input database. Recent work has focused on scaling data-mining algorithms to very large data sets. In this survey, we describe a broad range of algorithms that address three classical data-mining problems: market basket analysis , clustering, and classification. A market basket is a collection of items purchased by a customer in an individual customer transaction, which is a well-defined business activity—for example , a customer's visit to a grocery store or an online purchase from a virtual store such as Amazon.com. Retailers accumulate huge collections of transactions by recording business activity over time. One common analysis run against a transactions database is to find sets of items, or itemsets, that appear together in many transactions. Each pattern extracted through the analysis consists of an itemset and the number of transactions that contain it. Businesses can use knowledge of these patterns to improve the placement of items in a store or the layout of mail-order catalog pages and Web pages. An itemset containing i items is called an i-itemset. The percentage of transactions that contain an itemset is called the itemset's support. For an itemset to be interesting, its support must be higher than a user-specified minimum; such itemsets are said to be frequent. Figure 1 shows three transactions stored in a rela-tional database system. The database has five fields: a transaction identifier, a customer identifier, the item purchased, its price, and the transaction date. The first transaction shows a customer who …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining Very Large Databases Using Software Agents

Some databases are simply large (e.g., with ter_abytes of data). A number of potential and diverse patterns in databases would sufficiently be mined so as to support various decisions in applications. This research advocates a re-recognition and re-cogitation to how to discover very large databases. It also presents a new technique mining very large databases will be developed on different hier...

متن کامل

Mining Spatial Data Using An Interactive Rule-Based Approach

With the advent of very large spatial databases, it is beyond human capacity to examine and understand the information contained within such volumes of data directly. Although data mining has been recognized as a key means of finding patterns in large databases, general data mining methods alone are not sufficient for spatial data mining. The strengths of the computer’s ability to perform numer...

متن کامل

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases

Methods for efficient mining of frequent patterns have been studied extensively by many researchers. However, the previously proposed methods still encounter some performance bottlenecks when mining databases with different data characteristics, such as dense vs. sparse, long vs. short patterns, memory-based vs. disk-based, etc. In this study, we propose a simple and novel hyperlinked data stru...

متن کامل

New Parallel Algorithms for Frequent Itemset Mining in Very Large Databases

Frequent itemset mining is a classic problem in data mining. It is a non-supervised process which concerns in finding frequent patterns (or itemsets) hidden in large volumes of data in order to produce compact summaries or models of the database. These models are typically used to generate association rules, but recently they have also been used in far reaching domains like e-commerce and bio-i...

متن کامل

A Probabilistic Bayesian Classifier Approach for Breast Cancer Diagnosis and Prognosis

Basically, medical diagnosis problems are the most effective component of treatment policies. Recently, significant advances have been formed in medical diagnosis fields using data mining techniques. Data mining or Knowledge Discovery is searching large databases to discover patterns and evaluate the probability of next occurrences. In this paper, Bayesian Classifier is used as a Non-linear dat...

متن کامل

An incremental mining algorithm for maintaining sequential patterns using pre-large sequences

Mining useful information and helpful knowledge from large databases has evolved into an important research area in recent years. Among the classes of knowledge derived, finding sequential patterns in temporal transaction databases is very important since it can help model customer behavior. In the past, researchers usually assumed databases were static to simplify data-mining problems. In real...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

IEEE Computer

دوره 32 شماره

صفحات -

تاریخ انتشار 1999

Mining Very Large Databases

نویسندگان

چکیده

منابع مشابه

Mining Very Large Databases Using Software Agents

Mining Spatial Data Using An Interactive Rule-Based Approach

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases

New Parallel Algorithms for Frequent Itemset Mining in Very Large Databases

A Probabilistic Bayesian Classifier Approach for Breast Cancer Diagnosis and Prognosis

An incremental mining algorithm for maintaining sequential patterns using pre-large sequences

عنوان ژورنال:

اشتراک گذاری